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REMAR KS 

Cams 1-20 were oripnauy filed and were subject » a Restrict Requtrement. Applied 
affirm dec** with traverse, of original claims 1-6, copending to the invention rfOn.pl. 

Justification for the amendments ts . follows. The specification hs been amended » dele* 
rrference ,o certain web sites recited in the appiication. No new matter is added by any of these 
amendments. 



The Examiner noted that the Brief Description of the Figures artd Table on page 5 of the 
specification containsarefcrralto Tables 1 and 2. However, the instant specification is missing tire 

indeed subnutted with ft. application as ori*nally filed a, pages 3S and 39. Enclosed is a copy of the 
RetumPostcardcieaHyshowingtite referencedTab.es 1 and 2 as pages 38 and 39 of theses— 
For the Examtnert convenience, a copy of Tables 1 and 2, a, originally filed, is also attached to thts 

response. 

The disclosure is objected to because it contains an embedded hyperlink and/or other form of 
browser-executable code. » p. 28, line 10 and p.29, line 21, for example. Applicant is reared to 
delete the embedded hyperlink and/or other form of browser-executable code. See MPEP 8 608.01. 

Applicants submit that the MPEP states at 8 608.01 that this policy is based on the pnnctple 
that "USPTO pohcy does not permit the USPTO to link to any commercial sites since the USPTO 
excises no control over the organization, views or accuracy of the information contatned on those 
outside sues (underline added). Section 608.01 goes on to state that "where hyperlink, and/or other 
forms of browser-executaWe code, are a part of the applicant s invention and ,. ts necessary ,0 have 
them .ncluded in the patent application in order to comply with the requirements of 35 U.S.C. 1 12, first 
pmgraph 1n1 *~ «* intend to hay sjBT r^ il l V. * «»» *** 

hvnerhnks ™, nffi.. w.U cinhl- tM- F"1"™f ™ ^ 
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l7n ^^^n. S PTnwebdataba Se (underline added). Applicants point out that the cite4 website 
is a non-commercial, government web site which should not be subject to the requirements of MPEP § 
608.01. However, this citation, as well as a second at page 33, line 26 have been deleted. Withdrawal 
of the objection is therefore requested. 
M U.S.C S 10 1 , Pffo ri ™ "f Claims 1-6 

The Examiner has rejected claims 1-6 under 35 U.S.C. § 101 because it is drawn to an 
invention with no apparent or disclosed specific and substantial credible utility. The instant application 
has provided a description of an isolated DNA encoding a protein and the protein encoded thereby. 
The instant application does not disclose the biological role of this protein or its significance. The 
Examiner stated that it is clear from the instant application that the protein described is what is termed 
an "orphan protein" in the art and that there is little doubt that after complete characterization this DNA 
and encoded protein may be found to have a specific and substantial utility, however, that this further 
characterization is part of the act of invention. See Brenner V. Manson 148 USPQ 689 (Sus. Ct, 
1966). 

The Examiner reiterated Applicants characterization of the human TIMM8b to which the instant 
protein, TRP, bears sequence homology to (85% amino acid identity) and its association with various , 
neurodegenerative and neuromuscular diseases involving defects in oxidative phosphorylation. The 
Examiner, however, cited various publications regarding an alleged uncertainty in the art in predicting 
protein function based on structure (Skolnick et al. (2000) and Bork et al. (1998)) and concluded that 
the function of TRP could not unfflulvQcally be extrapolated from its structural characteristics (underline 
added). The Examiner concluded that the instant specification fails to provide any evidence or sound 
scientific reasoning that would support a conclusion that the instant nucleic acid or encoded protein is 
associated with any diseases or disorders. 

Applicants disagree that the claimed invention is not supported by either a well-established 
utility or a specific and substantial asserted utility. The claimed invention is in fact supported by b^ a 
well established utility and a specific and substantial, asserted utility. 

TRP is supported by a well established utility based on its structural ajidjajsaicjial identity with 
TIMMSb, disclosed in the specification as a mitochondrial protein involved in neurodegenerative 
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disorders, such as Mohr-Tranebjaerg syndrome. See BACKGROUND, pp. 1-2 and Paschen £Lal 
(2000). The identification of TRP as a TIMM8b protein is based on a high level of sequence identity to 
TIMM8b. In addition to an overall sequence identity of 85% with TTMM8b, The sequence alignment 
presented in Figure 2 clearly shows that TRP is 100% identical with TIMM8b over 82% if the 
sequence of TRP, differing only by a 15 amino acid insert at the N-terminal end of the molecule. It is 
well known in the art that such N-terminal sequences are most likely signal peptides related to protein 
secretion or subcellular localization rather then to altering function. In fact, as described in the 
specification at p. 9, line30-31, TRP retains (100% identical) the CX 3 CX U CX 4 C motif of 
mitochondrial import proteins and which is characteristic of DDPfTIM family proteins. Thus there is a 
substantial likelihood that TRP is functionally as well as structurally related to othr DDP/TBvI family 
proteins, such as TIMM8b. In addition, It is well-known that the probability that two unrelated 
polypeptides share more than 40% sequence homology over 70 amino acid residues is exceedingly 
small. Brenner et al„ Proc. Natl. Acad. Sci. 95:6073-78 (1998). Given homology in substantial 
excess of 40% over many more than 70 amino acid residues, including the conservation of functional 
motifs, the probability that the polypeptide encoded for by the claimed polynucleotide is related to 
TIMM8b is, accordingly, very high. The Examiner must accept the applicants' demonstration that the 
homology between the polypeptide encoded for by the claimed invention and TIMM8b demonstrates 
utility by a reasonable probability unless the Examiner can demonstrate through evidence or sound 
scientific reasoning that a person of ordinary skill in the art would doubt utility. Sec In re Longer, 503 
F.2d 1380, 1391-92, 183 USPQ 288 (CCPA 1974). The Examiner has not provided sufficient 
evidence or sound scientific reasoning to the contrary. While the Examiner has cited literature 
identifying some of the difficulties that may be involved in predicting protein function, none suggests that 
functional homology cannot be inferred by a reasonable probability in this case. Sec Skolnick et al. and 
Bork et al., Office Action, p. 5. Most important, none contradicts Brenner's basic rule that sequence 
homology in excess of 40% over 70 or more amino acid residues yields a high probability of functional 
homology as well. Nor do they contradict the significance of the CX 3 CX M CX 3 C motif retained in 
TRP. At most, these articles individually and together stand for the proposition that it is difficult to 
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make predictions about function with certainty, The standard applicable in this case is not, however, 
proof to certainty (or unequivocal proof), but rather proof to reasonable probability. 

In addition, the claimed polynucleotide is also supported by a specific and substantial asserted 
utility that is independent of any knowledge of the encoded protein. This utility is disclosed in the 
specification at p. 17, lines 4-7 where it is stated "The cDNAs, fragments, oligonucleotides , 
complementary RNA and DNA molecules, and PNAs and may be used to detect and quantify 
differential gene expression for diagnosis of a disorder. Disorders associated with differential 
expression include cancer, particularly breast cancer, ovarian cancer, and kidney cancer-". This 
asserted utility of polynucleotides encoding TRP in the diagnosis of breast, ovarian, and kidney cancer 
is supported in the specification at p. 9, in the paragraph beginning at line 9 and Table 2, where it is 
shown that SEQ ID NO:2 shows overexpression in a breast tumor library (BRSTTUT14) compared 
with microscopically normal breast tissue from the same donor (BRSTNOT14) in which no expression 
of the transcript was detectable, Similarly, SEQ ID NO:2 was overexpressed in two kidney tumor 
libraries (KIDNTUT15 and KIDNTUT14) compared to libraries (KIDNNOT19 and KIDNNOT20) 
from matched (m) microscopically normal tissue from the same donors. SEQ ID NO:2 was also 
overexpressed in two ovarian tumor libraries (OVARTUP02 and OVARTUT03), Thus the claimed 
polynucleotide is useful in the detection and diagnosis of breast, ovarian, and kidney cancers 
independent of any knowledge of the polypeptide encoded by the polynucleotide. 

For all of the above reasons, Applicants believe the claimed invention is well supported by both 
a well established utility as a functional homolog of TIMM8b in the diagnosis of neurodegenerative 
disorders, such as Mohr-Tranebjaerg syndrome, as well as a specific and substantial asserted utility in 
the diagnosis of breast, ovarian, and kidney cancers. Withdrawal of the rejection of claims 1-6 under 
35 U.S.C. § 101 is therefore respectfully requested. 

35 U.SJUJL 12. First Paragraph. Rejection of Claims 1-6 

The Examiner has also rejected claims 1-6 under 35 U.S.C § 1 12, first paragraph, specifically, 
since the claimed invention is not supported by either a clear asserted utility or a well established utility 
for the reasons set forth above, one skilled in the art would clearly not know how to use the invention. 
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, , • i « „«h«- 15 U S C § 101 for lack of utility has been addressed 
The rejection of claims 1-6 under 35 u.a.^- 9 ^ 

— — — ==— 

established utility, as wi ^ a P -c^rW^US C $ 

_ how to use ine Carmed inv.n.icn. Wit^wal of the rejection of c.auns .-6 under 

1 12, first paragraph is therefore requested. 

u.^.v,. ii^ a«TTcrsil2 second paragraph as being 

The r*amin=r has rejected claim. 2 and 6 under 35 U.S.C. § 1 12, secona gr P 

noefUute for faiiing to P^cularly pom, out and distinctly Calm the subject maucr wn.cn apphc 

<_ « SEO ID NO-5 Although according to the definiuon presented in the 

^.citation of "a fragment "of SEQ1L> JNU.3. ft'^s 1 ° 

h««Dairs in length" a general meaning of arragmeni 

base parrs >«P nucleotides long can have a fragment 

that it is not clear how a sequence of SEQ ID NO.z.wmc 

of SEO ID NO-5 which is 598 nucleotide, long. Cl.rificat.on ,. required, 
that is a sequence of SEQ IB r»u.3,wm , eB nTnNO-2 
EST fragments of full-length consensus sequences, such as the msum SEQ ID N0.2, 

^foreafragmentofSEQIDNO^fromwhichtheconsensussequenceisdenvcd. 

. Bdbv me host ^ of cairn 5. I. is not clear which protein produced by a host cell ,s intended 
ces of a nucleic acid The Examiner also stated that claim 6 is indefinite and amorous for 

complementary to the nucleic acid encoding SEQ ID NO-. 1 . 
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Both claims 4 and 5 indeed do depend from claim 1 and are therefore subject to the limitations 
of the polynucleotides recited in claim 1 , e.g., a polynucleotide encoding SEQ ID NO: 1 or Us complete 
complement. Since the complementary sequence to a polynucleotide encoding an open reading can, 
itself, encode a protein, it is customary to include it any expression system for expressing a protein, such 
as that recited in claim 6. With these explanations, Applicants believe claims 2 and 6 are clear and 
definite, and respectfully request withdrawal of the rejection of these claims under 35 U.S.C. § 1 12, 
second paragraph. 
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CONCLUSION 

In light of the above amendments and remarks, Applicants submit that the present application is 
fully in condition for allowance, and request that the Examiner withdraw the outstanding rejections. 
Early notice to that effect is earnestly solicited. Applicants further request that, upon allowance of 
claim 1, claims 7-12 be rejoined and examined as methods of use of the polynucleotides of claim 1 that 
depend from and are of the same scope as claim 1 in accordance with Ochiai and Brouwer. See 
MPEP § 821.04 and the Commissioner's Notice in the Official Gazette of March 26, 1996. 

If the Examiner contemplates other action, or if a telephone conference would expedite 
allowance of the claims, Applicants invite the Examiner to contact Applicants' agent of Record, below. 

Applicants believe that no fee is due with this communication. However, if the USPTO 
determines that a fee is due, the Commissioner is hereby authorized to charge Deposit Account No. 
09-0108. 



Respectfully submitted, 
INCYTE GENOMICS, INC. 

Date: n^£h- / 

David G. Streeter, PhD. 

Reg. No. 43,168 

Direct Dial Telephone: (650) 845-5741 

3160 Porter Drive 
Palo Alto, California 94304 
Phone: (650) 855-0555 
Fax: (650) 849-8886 
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VERSION WITH MARKINGS TO SHOW CHANGES MADE 



IN THE SPECIFICATION 



Paragraph beginning at line 10 of page 28 has been amended as follows: 

The BLAST software suite (NCBI, Bethesda MD[; 
http://www.ncbi.nlm.nih.gov/gorf/bl2.html]), includes various sequence analysis programs including 
"blastn" that is used to align nucleotide sequences and BLAST2 that is used for direct pairwise 
comparison of either nucleotide or amino acid sequences. BLAST programs are commonly used with 
gap and other parameters set to default settings, e.g.: Matrix: BLOSUM62; Reward for match: 1; 
Penalty for mismatch: -2; Open Gap: 5 and Extension Gap: 2 penalties; Gap x drop-off: 50; Expect: 10; 
Word Size: 1 1 ; and Filter: on. Identity is measured over the entire length of a sequence, Brenner et al. 
(1998; Proc Natl Acad Sci 95:6073-6078, incorporated herein by reference) analyzed BLAST for its 
ability to identify structural homologs by sequence identity and found 30% identity is a reliable threshold 
for sequence alignments of at least 150 residues and 40%, for alignments of at least 70 residues. 

Paragraph beginning at line 15 of page 29 has been amended as follows: 
Following assembly, templates were subjected to BLAST, motif, and other functional analyses 
and categorized in protein hierarchies using methods described in USSN 08/812,290 and USSN 
08/81 1,758, both filed March 6, 1997; in USSN 08/947,845, filed October 9, 1997; and in USSN 
09/034,807, filed March 4, 1998. Then templates were analyzed by translating each template in all 
three forward reading frames and searching each translation against the PFAM database of hidden 
Markov model-based protein families and domains using the HMMER software package (Washington 
University School of Medicine, St. Louis MO[; http://pfam.wustl.edu/]). The cDNA was further 
analyzed using MACDNASIS PRO software (Hitachi Software Engineering), and LASERGENE 
software (DNASTAR) and queried against public databases such as the GenBank rodent, mammalian, 
vertebrate, prokaryote, and eukaryote databases, SwissProt, BLOCKS, PRINTS, PFAM, and 
Prosite. 



101420 



10 



09/781,117 



Received from < 415 852 0195 > at 10J10/D2 8:33:19 PM [Eastern Daylight Time] 



flc Mailed'- F^nitry S-2001 

Rxprt* Mall So.; rr.TJ»m 723 JLS Qw , m ^ PC -00 -U LS 

COMMISSIONER FOR PATENTS 
BOX PATENT APPLICATION 
WASHINGTON. DC 202.1l 

Tltk; TLMMSb-HELATED PROTON 

Filint Datr Herewith 

Enclosed: 

X Submission of Sequence Listing 1 1 page) 

X One Compuwr-'"tiablc Diskd» 

v 17 Pa»e& of Specification (1-37); 

| "TPdJnof tables (Table I-21C3W9): 

X 5" Pages of Claims (40-41 V. 

v ' f Page of Abstract (42); 

X "T Sheets of Figures OA. I B.»nd 2) 

y 5 Pases of Sequence Listing (1-5); and 



"wMiuNarataajgiw^ Mailtd; Februafy8 20OJ 

COMMISSIONER FOR PATENTS " ^ Na: PCb0W4 US Jl 036 U S PTO 
BOX PATENT APPLICATION HO /Will 7 

WASHINGTON, D. C. 2Q23 1 US/781117 

HI™* — "IliP™ 

£ncJOwd: 

X Return Receipt Po$tcsrd 

£ Transmittal for Patent Application (1 page, in duplicate) 
X. Submission of Sequence Listing (1 pag 3 ) 
X One Ctirnpuf^readjibJe Diskette 
2£ _JZ Pages of Specification (1-37), 

& 2_ Pages of Tobies (Table J-2j (3S-39); 

i 1 Pages of Claims (40-41); 

X . 1_ Page of Abstract (^2); 

S i Sheets of Hgures (1A, [fl, aqd 2) 

^ 5 Pages of Sequence Listing (1-5); and 

X _3 Pages - Unexecuted Declaration wd Power of Attorney. 
DGS/ki 



Received from<415 852 0195 > at 10J10I02 8:33:19 PM [Eastern Daylight Time] 






triune 




Abs 


Pet 


i issue ate gory 


Count 


Found in 


Abund 


Abund 


LarQlOYdOCU.ai aystem 


25619 0 


17 / 63 


32 


0 . 0120 


Connective Tissue 


1 LA f\A ^ 




1 1 


0 . 0075 


Digestive aysuem 


5 OllOl 


23 / 148 


34 


0 . 006B 


EirLbryonic Structures 




5/21 


17 


0 .0159 


Endocrine Sys*ein 




14/ J J 




0 . 0089 


Exocrine Glands 


2 54 63 5 


15/ 64 


21 


0 , 0082 


Reproductive, Female 


4z / 


Z *J / J. u o 


c o 


v * v v D v 


Reproductive, Male 


*i % O A \J I 






0 . 0071 




3 32 82 


3/5 


5 


0 . 0131 


Hemic and Immune system 


680277 


35/159 


55 


0 .0031 


Liver 


109378 


7/35 


17 


0,0155 


Musculoskeletal System 


159260 


9/47 


11 


0,0069 


Nervous System 


955753 


55/198 


76 


0.0080 


Pancreas 


110207 


1/24 


1 


0 .0009 


Respiratory System 


390086 


23/93 


33 


0 .0085 


Sense Organs 


19256 


0/9 


0 


0,0000 


Skin 


72292 


3/15 


4 


0.0055 


Stomatognathic System 


12923 


1/10 


3 


0.0232 


Unclassified/Mixed 


120926 


3/13 


13 


0 .0108 


Urinary Tract 


279062 


10/S4 


19 


0,0068 


Totals 


5321833 


278/1292 


432 


0.0081 



TABLE 1 



38 

Received lrom<415 852 0195 > at 10(10102 8:33:19 PM [Eastern Daylight Time] 



!P n r 1 ^) tC 

o r*- O LH 

kD LTI (N 

O o o o o 

o o o o o 



< 5 



N N CS Oi rl 



O 

(T» rsl 



§ Q 

P M 

M ^ 

E S 

in n 



*H tft 



01 0) 

o 
u 



1 o 

X5 



m H 

2 9 

es c 
-J 



>i >i 
O ^ M 



m in 



3 



G S 6 



O 
0 

5 



33 S 
in n 



U H H 



V 

Q 



4-1 CD flJ 

u 2 d 

d g> to 



J >i >i 

a* o) <u 

<n c c 

.0 * 




C5 



o n cd 

O M5 r-. 

go cn r» 

r*> vd r*i 



2 




s 



3 



39 

Received from < 415 852 0195 > at 10110/02 8:33:19 PM [Eastern Daylight Time] 



RdtTtnct * * 



fme Noli Acad. St.. CrX4 
BKkchemisiry 

Assessing sequence comparison iu£otowtA reliable structurally 
identified distant evolutionary relationships 

envoi. E BMNNM't* CYRUS CHOTH1A", AND TlM J. P. HUBBARD*" 

US^SSZ-i- — — — « — 

Carnh* CB10 ISA. U*H«d Kinjdom 



ABSTRACT Mr** «q U ..c "™ l^l^W 
bee. ess«s..d prolans *hwt relstio»Ships 
reTebly from tb.it structures ..d tacltah « eW- I, 
On sCOP daUba* {Mania, A. C, Braner, S, £., Hj™**- 

TtoB us..d the program, iuh 
21S 403-4101. WlrtlASn AlUthBl, S. f. * Glsb, 

,»d ssTAJtCH (Smitt, T, F. & Waters,** M. S. (Will /AW. 
Krf 147. 195-197) and **»» ' c *' we1, Tbt tiror ntt 

to traluate n»tcb« rather thai »cr«*Bt*ge .death, or ran- 
dom Thl £v,lu. stndstkal .com of sslAlffl .*J *«ta .r. 
Sir*. M*b.r ofhl« pcWmfOWrf to«ur«t. agree. 

bv BLAST Ud WU-BLASTJ exaggerate Slg»lfie»ee* bj Ort*rCW 

„ a ,«1iude. swatch, fast* top » 1, wd wu-ium p«rron« 

more distantly r*i»ud prouto*. tbey do ».ch l» 
one-half of tbt reletlo.*hips between proteins with 30-30% 

similarity, W«« distant ralationbiP* cn-ot be detected by 
S pLirWi,* con.p-rl.oo method; how™. •»•»« which are 
identified may b. u nd with tSBIldance. 

Sequence database searching play, a ^/.^"f 
tone* of molecular biology and >> erucul for >n«efPf e »«8 «■ 
sequence* i»uin t fpnh from |»nmM pw* J^ 1 ^ 
method central role, h is surpns.ng th.t evortB 1 r«U t v. 
cpahiliiie* of different procedure* are >«*ft «™e 
difficult to verify ilgoriihmi on sample data becaiwe ihn 
™S Urge d?t a of protein. whc« evolui»n.^ rela- 
"£ vl known unambiguously and independenlb -of *e 
mettod. being evaluated. However, nearly all known ho- 
ffio have b!.ridemif.ed by .equence analysi. (the method 
K t«l«7) Al» it i< generally very diffinll .0 know .n .he 

teauenee similarity a« unrelated. This ha* meant that al- 
So^^SSl "valuations have helped improve sequence 
cSrbZ «'ey h^e offered from 
chmcterlzed. or artificial test data. Asseasmem alio has been 
problematic because high quality database Sequence^ se.rch.ng 
auempt. to have both sensitivity (detect™ of homologs and 
spcciricity (rejenion of unrelated proteins); however these 
complementary goals are linked .uch that .Baeumt °" e 
cause, the other 10 be reduced. 

Th. m,bl«ii»n coin of *« "'"H * p "!l!^ t 

i«r«l»r)« wilh 18 VS.C- HTM wlfty lo indole lh- Ian 

e .IN b, Tt. KM. A-eW - ^^'W^*' 8 
FNA* a l«ium ontiM >i aiBV.//ww«^>Mi.«i 



Sequence comparison methodologies have evolved rapidly, 
so nopreviousiy published tests ha* evaluated modern versions 
of programs mm» F«" e«n«f^ in 
biaVt (1) have changed, and wu-bi^Tj (2)-whlch produces 
.aoped alioments-ih.* become avaflable. The latest versum 
o'^Sta 3)previously t«ted w»a 1.6. but the current release 
Jerslo^ 3 0) Prides fundamenully different result, u> the 
form of statistical scoring- . , , 

The previous reports abo have left gapt m •"^^SJi* 
For example, there has ben. no publ.shed =^" ,em ° 
thresholds for scoring scheme* more s^henica ed thaii per 
centag. identity. Thus, the widely discuaed « alBu . cal 
measures have never actual* been evaluated on Urge data- 
Sul proteins. Moreover, the different awing schemes 
commonly bi use have not been compared. 

Beyond these issues, there is a more fundamental quettion: 
in an absolute sense, how well does pairwfee sequence com- 
parison work? That is. what fraction of hom^ogous proteuM 
can be detected using modern database searching ^ methods? 

In this work, we attempt to answer these questions and to 
ovwome both of the fundamental difficulties that have huv 
dered asaeasment of sequent companion "•^•f* 1 * 
First we use the set of distant evolutionary relationships in tne 
Hor: structural Oassiftcation of Protein, database fojta* 
it derived from structural and functional characterurtaa (5). 
The scop d»tabase provides a uniquely reliable set of no- 
molog*. which are known independently of sequence eompar- 
isonTecond, we use «, ssseasment method that gfa"** 
sures both sensitivity and specificity. Th* method altewi 
straightforward comparison of diff erenl "^""J 
nrocedures. Further, it can be used to aid interpretation of real 
aa ^»« searches and thus provide optimal and rtl-ble 

"pr£ifm As.essn.epts of S«q«.«« Comparlw.. Several 
previous studies have examined the relatrve r*rl°rm«nceof 
different sequence comparison methods. The raoll encom. 
pSan.ly.es have be P en by Pearson 
The three most commonly used programs. O ^thesej d» S m. h- 
Waterman slgorithm (8) implemented in s*Ea»ch (3) in ti* 
oldest and slowest but the most rigorous. Modern heuristics 
have provided OLast 0) s I** d «* corrven^nce tomake 
it the most popular program. Intermediate between these : I WO 
i W be run in .wo mode* offering either 

Jreater sp^ed (ktup - 2) or greater erfectrveness ktup 
Pearson ilso considered different p»r.meters for each of these 

*To™«t the methods. Pearson seleewd two £Q>*> 
proteins from each of 67 protein ruperfamilies definedby tbe 
pir database (9). Each was used «* a query to search ihe 
database, and the matched proteins were " ™£ 

homologous or unrelated according to their membership orpin 



venitv Fairshild BuOding D-109. Stanford, CA vaiM-SIl* 
IT. K ^r^ii.. requeft, sbowW be sdditsaed. e-nuul: brenner* 
hypcfJiirritml.edu. 
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superfamilies. Pearson found that modem matrices and "In- 
scaling'* of raw scores improve results considerably. He also 
reported that the rigorous Smith- Waterman algorithm worked 
slightly better than fast a, which was in turn more effective 
than blast. 

Very large scale analyses of mBirice* have been performed 
(10), and HeniKoff and Henikoff (11) also evaluated the 
effectiveness of blast and FAST A, Tneir teat with blast 
considered the ability to deled homologs above a predeter- 
mined scoTe but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosit* (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix ()4) performed markedly better than the 
extrapolated PAM-scries matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homolop. Bui in 
Pearson's and the Henikoffe* evaluations of sequence com- 
parison, the coned results were effectively unknown. This is 
because ihe superfamilies in pir and PROSrTE are principally 
created by using the same sequence comparison methods 
which are being evaluated. huerdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homology missed by older programs. For instance, 
immunoglobulin variable and constant domains are dearly 
homologous, but PiR places them in different super families, 
The problem is widespread: each super family in PIR with 
a structural homolog is itself homologous to an average of 1.6 
Other PIR superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the HSSP equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
Alignments require higher idemiry. (Other studies also have 
used structures (16-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the blast program using the 
Karlin and Alischul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ss£arch. In 
addition to being heralded as a reliable means of recognising 
significantly Similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the Blast 
algorithm" ( 1 ), The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and reference* in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homology Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery thai the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships lhan comparing sequences If two proteins show a high 
degree of similarity in their struciural details and function, it 



is very probable that they hive an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information conv 
bined with the comprehensive evolutionary classification in 
the scop database (4. 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The SCOP database 
uses structural information to recognize distant homologs, the 
large majoriry of which can be determined unambiguously. 
These superfamilies, such as the glob ins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological communiry despite the lack of high sequence sim- 
ilarity. 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One (pdbwd-b) has domains, which were ail <9Q% 
identical to any other, whereas (pdbwd-b) had those <40% 
identical. The databases were created by first sorting ail 
protein domains in scop by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. AhO removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40i>B database 
contains U23 domains, which have 9,044 ordered pairs of 
distant relationships, or -0.5% of the total 1.749.006 ordered 
pairs. In PDBP0D-B, the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of ail pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the SEC program 
(27) using recommended parameters: 12 1.82-0. The databases 
used in this paper are available from http://ssa.stanford.edn/ 
sss/. and databases derived from the current version of SCOT 
may be found at hnpV/Kop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
FDB4DD-B focuses on distantly related proteins and reduces the 
heavy overTepresentation m the PDB of a small number of 
families (31, 32), whereas pdbiod-b (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40P-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trend* to be 
general. 

Assessment Data *d4 Procedure, Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using jusi a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliabDity 
of scoring procedures, Including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms ( using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwiae 
sequence comparison to recognize them. AH of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested blast 0>, version 1.4.9MP, and wu» 
blast? (2), version 2.0a 13MP. Also assessed was the fasta 
package, version 3.0t?6 (3), which provided FAST A and the 
ssearch implementation of Smith- Waterman (8). For 
sseaRcm and fasta, we used BLOSUM4S with gap penalties 
-)2/-l (7, 16). The default parameters and matrix (BIO* 
SUM62) were used for blast and wu-BlaSTI. 

The "Coverage Vs. Error" Plot, To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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perfect separation, with all of the horoologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead One is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage sod error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold, this reflects the sensitivity of a method 
Errors per query (EPQ). art indicator of selectivity, is the 
number of nonhomologous pain above the threshold divided 
by the number of queries. Graph* of these data, called 
coverage vy error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required m 
sequence comparison and the huge background of nonho- 
molop. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary *o perform a reliable sequence database 
search- The EPQ measure places a premium on score consks* 
lency; that is, it requires scores to be comparable for different 
queries. Consistency u an aspect which has been largely 

r*rcwil MMttfty Of UnreMMd PretMAt 1PDB*fJD-») 
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Fig. 2. Unrelated pronirn with high percentage Identity. H«rrx> 
olobin ^chain (pdo code Ihdt chain b. ref. M. Uft) and cellvlwe E2 
{?Dfi code Itml. ref. 39. Right) have M* identity over 64 residues, a 
level which ii often behevtd to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proiein* are not r*bi*d Appropriately, neither the raw alignment 
wore of R5 nor the E-value of U » significant. Ftoteuu rendered by 

JLaSMOL (40). 



too 

AHejnfnofit hMfA^ 

Fic. 3. Length and percentage identity of alignments of unrelated 
proteju in PDAtto-B; Each pair of nonhomologous proteins found with 
SLOUCH is p loiiwi as a point whose po&iiion indicates the length and 
the percentage identity wiihm the alignment Because alighment 
length and percentage identity are quenliiad, many pain of proteins 
may have exactly the same alignment tength and percent age identity. 
The line shows the Hs&p threshold (though it u intended to b« appUed 
with a different mainx and parameters). 
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Fig. 4. Reliability of statistical Kores in roawD-a: Each line shows 
the re 1 alio ru Kip between reported statistical score uid actual error 
rate for * different program. E-values arc reported for SSEaAch and 
FAST A, where** P.valuu ire shown (or BLAST and WU-BLATTT If lb> 
storing were perfect, then the number of error* per query mod the 
Rvalues would be the same, as indicated by the upper boM line. 
(P-vatucs should be the same as EPQ for small numbers, and diverges 
at higher vataje*, u indicated by the tower boM Line.) £-v*Ju*s from 
ssEAJtcn and FaSTa are shown to have good agreement with EPQ but 
underestimate iht significance sli&htry. BLAST and wu-blasti eft 
overconfident, with the degree of exaggeration dependent upon the 
jrore- The results for pdmod-b were simitar to those for ftjbwob 
despite the difference in number of homologs detected. This graph 
e c>j | d b« used id roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence com pa risen results. 
Further, h provides a clear indication of the confidence that 
should be ascribed to each match, Indeed, the EPQ measure 
Should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity* which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second b a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matria score* for each position m The align- 
ment and subtracting gap penalties. In BLAST, a measure 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. "Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
ruk*of*thumb staling that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that these thresholds, 
originally derived years ago. are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pain of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the FDbmD-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
m length before 40% is a reasonable threshold, for a database 
of this particular size and compos it ion. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the HSSF equation improves the value of percentage 
identity, but even ibis measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison, 

Raw Scores. Smith-Waterman raw scores perform bener 
than percentage identity (Fig. )),but In-scaling (7) provided no 
notable benefit in our analysis. It at necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ- However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters* 

Statistical Scores, Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 

Sequence Ceme+rtoon Algorithm (PDB9QD-B) 




Fks 5. Coverage vs. error plcts of different sequence comparison methods: Five different serene* comparison methods are evaluated, each 
usine statistical scenes (E- or P-values). {A ) pdwd- a database. In 1hii analysis, the best method is the slow sst>*CH. which finds 1B% of relattonshie* 
at 1% £PO fa st a ktup - t and wu-susn are almost as H ood. (B) fdbwd-s ditabii*. The quick wu-aiASTa program provides the beat coverage 
at 1ft EPQ on this database, although ai higher levek of error ii fcw>mw slightly worse than fast a ktup - 1 and sseaACH. 
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tikeW its power ran be Bttrrbuied 10 its incorporation of more 
information than any other measure; it U*« account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. t , . . . 

We find that 51 aiisiical scores are not only powerful, but also 
easy 10 interpret. S SEARCH and FA5TA show dose agreement 
between statistical scores and actual number of error* per 
query (Fig. «). The expect liion velue score gives a good, 
sliihtry conservative estimate of the chances of the two se- 
quence* being found 11 random in a given query .Titus, an 
E-value of 0.01 indicates thai roughly one pa" of nonnomologs 
Of This similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be mterpreied 
in this way, and these result* vilidate the suitability of the 
extreme value distribution for describing the score* from a 
database search. 

The P-vaJues from blaJT also should be directry interpTet- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ hr this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate. W-BLAST2 scores were more re- 
liable than those from BLAST, bui also exaggerate expected 
confidence by more than an order of magnitude at 1% EP Q. 

OveraU Detection of Honologs aid Cotapansoi ©f Algo- 
rithms. The results in Fig. H and Table 1 show that pairwc* 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences m pdmom 
Even s SEARCH with E-value*, the best protocol tested, could 
find only 18% of all relationships at a \% EPQ. BLa*T, which 
identifies 15*. was the worst performer, whereas fasTa 
ktup = 1 is nearly as effective as sstAJtcH. FAST a letup = 2 and 
wuiusn are intermediate in their ability to detect ho* 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more hornologs are generally 
slower ssearch is 25 times slower than blast and 6.5 times 
slower than fasta ktup -= 1- wu-BLAsn is slightly faster than 
Fasta ktup - 2, but the Utter has more interpretable scores. 

In fDBMD-b, where there are many close relationships, the 
best method can identify only 38% of structurally known 
hornologs {Fig. 5fl). The method which finds that many 
relationships is wu-blasTz Consequently, we infer that the 
differences between fast a kup - 1. ss larch, and wu-blastj 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliabibty. 

Fig 6 helps to explain why most distant hornologs cannot b* 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance. 5 search with E*va)ues can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 ot these involve sequences with <50 
residues Of sequences having 25-30% identity, 75% are 
identified by sseauCH E-values. However, although the num- 
ber of hornologs grows at lowei levels of identity, the detection 
falls off sharply: only 40% of hornologs with 20-25% identity 
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Fig. 6. Distribution and faction of ho mo I op in fDswiM. Bats 
ibow the distribution of h«nok>gou» pairs PDB4GD-B accot ding to their 
identity (using the m**mn of idemiiy in both). Filled regions md»c*u 
the numberof ihese pain Found bribe best database searching m*thad 
(siUUtCH whh E-vahies) ai 1% EPQ. The n>B4»-B database contains 
prouiM whh <40% identity, and as shown on this graph* 
structurally identified hornologs in ihe database have diverged «a* 
tremely far in «quenee and have <20* identity. Note that the 
aJknmenu may be ioi««irais. specially at low level* of identity. Filled 
regions show that sSEaACH *»n identify most febliomhips thai hjjve 
23% or mor« ideniiry, but iis detecikm wajus sharply below 23*, 
Consequently, the great sequence divergence of most structurally 
identified evolutk»ary relahoiuhips effects ly defeats ih* abiliiy of 
parfwise sequence comparison to delect ihem. 

are detected and only 10% of those with 15-20% can b* fOUfhl 
These results show that statistical scores can find related 
proteins whose identity is remarkably tow; howeveT. the power 
of the method is restricted by the great divergence of many 
protein sequences. ... 

After completion of this work, a new version of pairwue 
blast was released: blastop (37). It supports gapped align- 
ments, like wu-BLASTO, and dispenses with sum statistics. Our 
initial tests on HLastgf using default parameters show that its 
E-values are reliable and that its overall detection of homolop 
was substantially better than thai of ungapped blast, but not 
quite equal to that of wu-blast? 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 2$, 21 
and references therein) suggests that the most effective se- 
quence searches are made by (i) using a large current database 
in which the protein sequences have been complexity masked 
and (») using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points, First, tnt fc-val- 
ues reported by fasta and ss&aUCH give fairly accurate 
estimates of the significance of each match, but the P-velues 
provided by blast and wu-blastj underestimate the true 
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extent of errors. Second, meaRCH, wu-bukstt, and Fasta 
ktup = 1 perform best, though BLAST and facta kiup - 2 
deiect most of ihe relationships found by the best procedures 
and are appropriate for rapid initial searches, 

The hdmologous proteins that ire found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pain. However, even the best database 
searching procedures lesied fail to find The large majority of 
distant evolutionary relationships at an acceptable error rale. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique: rather, it 
indicates that any relatives it might have are distant ones." 



* "Additional and updated information about tbj» work, including 

supplementary figures, may be found at htip://&te^unford.edu/*tt/. 
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